Exploration of Vancouver Trees

Author : Muntakim Rahman   UBC Student Number : 71065221

Introduction

This notebook will be conducting an Exploratory Data Analysis (EDA) for the Vancouver Trees dataset located in the small_unique_vancouver.csv file.

Import Packages

Observe Outputs

Let's start by getting an understanding of the data sparsity (i.e. NULL values), as well as the column distributions.

Data Sparsity

There are NULL occurances in the date_planted, plant_area, cultivar_name columns. Let's keep these for now to visualize the data in the entries without NULL values.

Non-Numeric Data

Observing the data stored as objects, there seem to be variation in distinct values for given columns.

The std_street and on_street column have greater than 600 distinct values and would not be good candidates for the EDA.

Looking at the date_planted column, it seems that there are only 1599 distinct values in the entire dataset. This would entail repeated dates across the entries, which is rather interesting.

The curb and root_barrier columns are binary in nature and should be one-hot encoded in our final analysis.

Numeric Data

Observing the data stored as type np.number, there seem to be differences in std deviation for given columns.

Based on the std deviation of 75412.260406, the tree_id column probably includes data for a unique identifier. We can use this to identify our trees, but it doesn't serve much other use for our EDA.

There is a very large std deviation for the civic_number column, with the min value being 2 and the max being 9113. There is similar behavior in the on_street_block column, which very similar mean, min, and max values to civic_number. I'm not particularly interested in these columns, but we can visualize the correlation.

The height_range_id column has a mean value, as well as a 25th and 50th percentile ~2 which is interesting. I'd like to see the distribution of this column.

The latitude and longitude column have a std deviation less than 0.1, which would entail most trees being in the same vicinity. We can try using this data to see where trees are densely concentrated.

Questions of Interest

We want to explore this dataset to understand :

Columns of Interest

We are going to be visualizing the data in the following columns :

Exploratory Visualizations

Q1 : What trees are commonly found in Vancouver?

Let's plot the count of each genus_name to visualize the most and least common trees within the city.

From `Figure 1` :

Q2 : Where are trees located in Vancouver?

Let's bin the latitude and longitude coordinates in a heatmap to visualize the tree density within a given area.

From `Figure 2` :

Q3 : What Sizes are Vancouver Trees? Is there a Relationship Between Diameter, Height Range ID and Plant Area?

Let's explore the plant_area column. The values in this column are stored as objects, which is interesting.

From `Figure 3` :

Let's look into the relationship between the diameter, height_range_id, and plant_area columns.

From `Figure 4` :

Q4 : What neighborhoods have the largest trees? What about the smallest trees?

Let's look at the breakdown of this data for both diameter and height_range_id by neighborhood_name.

From `Figure 4` :

Q5 : How did tree sizes change by decade?

From `Figure 5` :

Concluding Remarks

I would like to explore the data in these charts when filtered for criteria including :

A few questions start to emerge when looking at data for the columns we've considered for size, as well as trends over the decades.

Do trees of the same genus_name have similar numerical features? Do trees with the same neighbourhood_name tend to have the same genus_names? Where are more trees being planted over the decades? Has the tree density by neighbourhood_names changed over the decades?

Interactive Dashboard

Let's create a dashboard from the visuals above in order to start investigating these questions. This enables us to consider these data insights adjacent to one another. We're going to be filtering our charts with the neighbourhood_name, genus_name, and decade_planted fields.

References

These resources provide the data, theory and code segments for the EDA exploration in this notebook.